feat(mlx-grpc): support string stop sequences for chat and completion#1447
feat(mlx-grpc): support string stop sequences for chat and completion#1447zach-li-sudo wants to merge 9 commits into
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
📝 WalkthroughWalkthroughThis PR adds MLX stop-string handling: encode user stop strings to token IDs during request build, store stop token IDs in MLX sampling params, and resolve MLX matched-stop token IDs back to user-facing values during response processing; includes tests and a failing tokenizer mock. ChangesMLX Stop-Sequence Support
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b4ce20db09
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@model_gateway/src/routers/grpc/common/stages/helpers.rs`:
- Around line 83-99: The code calls resolve_mlx_stop_ids(stop, tokenizer) before
verifying the request is the MLX variant, which can tokenize unnecessarily and
produce spurious errors; change the order so you first match on
ProtoGenerateRequest::Mlx (e.g., if let ProtoGenerateRequest::Mlx(req) =
proto_request { ... }), bail early if not MLX, then check for Some(stop) and
only then call resolve_mlx_stop_ids(stop, tokenizer) and extend
sampling.stop_token_ids; ensure you still return Ok(()) when stop is None or
when sampling_params is missing.
In `@model_gateway/src/routers/grpc/proto_wrapper.rs`:
- Around line 741-747: The MLX variant's matched_stop_json() can return raw
integer token IDs; update the five unguarded call sites
(process_non_streaming_generate_response,
process_non_streaming_messages_response, process_non_streaming_chat_response
(harmony), the Harmony streaming response processing, and the Harmony streaming
variant) so they do not consume raw matched_stop_json() directly: either guard
with is_mlx() and call resolve_mlx_matched_stop_json() for Mlx, or always call
resolve_mlx_matched_stop_json() (which uses mlx_matched_stop_token_id()) before
using the value; ensure any code paths that previously read matched_stop_json()
now receive the resolved string form.
In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 422-453: Update the docstring for stop_strings_to_token_ids to
document that tokenizer.encode(...) errors are not propagated but are logged and
the corresponding stop string is skipped (i.e., both zero-token encodings and
encoder errors are warn-and-skipped), and clarify that only multi-token
encodings produce an Err result; reference the function name
stop_strings_to_token_ids and the call tokenizer.encode to make clear where this
behavior occurs.
- Around line 496-508: apply_mlx_stop_sequences currently extends
sampling.stop_token_ids with values converted by resolve_mlx_stop_ids without
checking if the request already set explicit stop_token_ids; add a validation
that rejects the request when both stop strings and explicit stop_token_ids are
provided. To fix, change apply_mlx_stop_sequences (or its caller) to receive the
original request's stop_token_ids (or a boolean flag) and if that vector is
non-empty and stop_strings is present, return a bad_request error; alternatively
perform this check in the caller before invoking apply_mlx_stop_sequences.
Ensure the error originates from the same error::bad_request pattern and
reference sampling.stop_token_ids, apply_mlx_stop_sequences, and
resolve_mlx_stop_ids in your change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8297696e-30a4-4e03-9896-ce3f2cb893d9
📒 Files selected for processing (13)
Cargo.tomlcrates/grpc_client/src/mlx_engine.rscrates/protocols/src/completion.rscrates/tokenizer/src/mock.rsmodel_gateway/Cargo.tomlmodel_gateway/src/routers/grpc/common/stages/helpers.rsmodel_gateway/src/routers/grpc/proto_wrapper.rsmodel_gateway/src/routers/grpc/regular/processor.rsmodel_gateway/src/routers/grpc/regular/stages/chat/request_building.rsmodel_gateway/src/routers/grpc/regular/stages/completion/request_building.rsmodel_gateway/src/routers/grpc/regular/streaming.rsmodel_gateway/src/routers/grpc/utils/chat_utils.rsmodel_gateway/src/routers/grpc/utils/mod.rs
|
Thanks for the contribution, Zhuo! The feature itself is small and the tests are nice — but the diff feels heavier than it needs to be because backend-type branching is leaking into orchestration code. A few things to consider before merge: 1.
|
|
Hi @zach-li-sudo, the DCO sign-off check has failed. All commits must include a To fix existing commits: # Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-leaseTo sign off future commits automatically:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c7e1728cf1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
♻️ Duplicate comments (1)
model_gateway/src/routers/grpc/common/stages/helpers.rs (1)
283-305: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winOptimize by checking MLX variant before tokenization.
The current order (check
stop, tokenize, then check MLX variant) can waste CPU tokenizing stops for non-MLX backends. Checking the MLX variant first avoids unnecessary tokenization:♻️ Proposed reordering
pub(crate) fn apply_mlx_stop_sequences( proto_request: &mut ProtoGenerateRequest, stop: Option<&StringOrArray>, tokenizer: Option<&dyn Tokenizer>, ) -> Result<(), Response> { + let ProtoGenerateRequest::Mlx(req) = proto_request else { + return Ok(()); + }; let Some(stop) = stop else { return Ok(()); }; - - if let ProtoGenerateRequest::Mlx(req) = proto_request { - let token_ids = resolve_mlx_stop_ids(stop, tokenizer)?; - let sampling = req.sampling_params.as_mut().ok_or_else(|| { - error::internal_error( - "mlx_sampling_params_missing", - "MLX GenerateRequest has no sampling_params; cannot inject stop IDs", - ) - })?; - sampling.stop_token_ids.extend(token_ids); - } - - Ok(()) + let token_ids = resolve_mlx_stop_ids(stop, tokenizer)?; + let sampling = req.sampling_params.as_mut().ok_or_else(|| { + error::internal_error( + "mlx_sampling_params_missing", + "MLX GenerateRequest has no sampling_params; cannot inject stop IDs", + ) + })?; + sampling.stop_token_ids.extend(token_ids); + Ok(()) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@model_gateway/src/routers/grpc/common/stages/helpers.rs` around lines 283 - 305, The function apply_mlx_stop_sequences currently tokenizes stop sequences via resolve_mlx_stop_ids before confirming the proto_request is the MLX variant, causing unnecessary CPU work for non-MLX backends; modify apply_mlx_stop_sequences to first early-return if proto_request is not ProtoGenerateRequest::Mlx, then proceed to check the stop Option and call resolve_mlx_stop_ids only when inside the MLX branch, and finally mutate sampling_params (sampling.stop_token_ids.extend(...)) as before to avoid wasted tokenization for non-MLX requests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@model_gateway/src/routers/grpc/common/stages/helpers.rs`:
- Around line 283-305: The function apply_mlx_stop_sequences currently tokenizes
stop sequences via resolve_mlx_stop_ids before confirming the proto_request is
the MLX variant, causing unnecessary CPU work for non-MLX backends; modify
apply_mlx_stop_sequences to first early-return if proto_request is not
ProtoGenerateRequest::Mlx, then proceed to check the stop Option and call
resolve_mlx_stop_ids only when inside the MLX branch, and finally mutate
sampling_params (sampling.stop_token_ids.extend(...)) as before to avoid wasted
tokenization for non-MLX requests.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a1ca32dd-17bc-490f-9af2-dc65c5b82612
📒 Files selected for processing (11)
crates/grpc_client/src/mlx_engine.rscrates/protocols/src/completion.rscrates/tokenizer/src/mock.rsmodel_gateway/src/routers/grpc/common/stages/helpers.rsmodel_gateway/src/routers/grpc/proto_wrapper.rsmodel_gateway/src/routers/grpc/regular/processor.rsmodel_gateway/src/routers/grpc/regular/stages/chat/request_building.rsmodel_gateway/src/routers/grpc/regular/stages/completion/request_building.rsmodel_gateway/src/routers/grpc/regular/streaming.rsmodel_gateway/src/routers/grpc/utils/chat_utils.rsmodel_gateway/src/routers/grpc/utils/mod.rs
…lightseekorg#1099) Signed-off-by: Zhuo Li <[email protected]>
Signed-off-by: Zhuo Li <[email protected]>
…d no-ops on non-MLX Signed-off-by: Zhuo Li <[email protected]>
Signed-off-by: Zhuo Li <[email protected]>
Signed-off-by: Zhuo Li <[email protected]>
Signed-off-by: Zhuo Li <[email protected]>
Signed-off-by: Zhuo Li <[email protected]>
Signed-off-by: Zhuo Li <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fac846fbd6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Manual testing results: stop sequence with MLX backend1. SMG with MLX backend setup
pip install -e grpc_servicer/
source .venv/bin/activate && python -m smg_grpc_servicer.mlx.server \
--model mlx-community/Qwen3-4B-Instruct-2507-4bit --port 50051
./target/debug/smg --worker-urls grpc://localhost:50051 --port 30002. Testing scenariosThis PR is to support string stop array in regular Key differences from vLLM
The gateway converts single-token stop strings to token IDs for MLX requests (the MLX proto has Token ID referenceToken IDs used in the tests below assume the Qwen3 tokenizer, which shares vocabulary with Qwen2.5:
Verify with the tokenizer if results are unexpected: from mlx_lm import load
_, tokenizer = load("mlx-community/Qwen3-4B-Instruct-2507-4bit")
print(tokenizer.encode("5 6")) # check IDs for context-free "5" and "6"3. ResultsTest matrix: 4 paths × 5 stop modes = 20 cases.
3.1 Chat, non-stream3.1.1 — stop string, single token ( curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["6"],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"id": "chatcmpl-019e12f3-af39-7bc1-a46c-4173bebe0cbc",
"object": "chat.completion",
"created": 1778434420,
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1 \n2 \n3 \n4 \n5 \n",
"reasoning_content": null
},
"finish_reason": "stop",
"matched_stop": "6"
}
],
"usage": {
"prompt_tokens": 21,
"completion_tokens": 11,
"total_tokens": 32
},
"system_fingerprint": "default"
},
"status": "HTTP 200"
}3.1.2 — stop string, multi-token ( curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Say: hi there and hello world!"}],
"stop": ["hello world"],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"error": {
"type": "Bad Request",
"code": "unsupported_stop_string",
"message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
"param": null
}
},
"status": "HTTP 400"
}3.1.3 — stop_token_ids ( curl http://localhost:3000/v1/chat/completions -s \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop_token_ids": [20, 21],
"stream": false,
"max_tokens": 100
}' | jq{
"id": "chatcmpl-019e12f3-b06a-7e30-9026-b277ef4cd022",
"object": "chat.completion",
"created": 1778434420,
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1 \n2 \n3 \n4 \n",
"reasoning_content": null
},
"finish_reason": "stop",
"matched_stop": 20
}
],
"usage": {
"prompt_tokens": 21,
"completion_tokens": 9,
"total_tokens": 30
},
"system_fingerprint": "default"
}3.1.4 — combined: single-token stop string + stop_token_ids ( curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["6"],
"stop_token_ids": [20],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"id": "chatcmpl-019e13c9-0ea1-77c0-b430-bf53fbcdede2",
"object": "chat.completion",
"created": 1778448404,
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1 \n2 \n3 \n4 \n",
"reasoning_content": null
},
"finish_reason": "stop",
"matched_stop": 20
}
],
"usage": {
"prompt_tokens": 21,
"completion_tokens": 9,
"total_tokens": 30
},
"system_fingerprint": "default"
},
"status": "HTTP 200"
}3.1.5 — combined: multi-token stop string + stop_token_ids ( curl http://localhost:3000/v1/chat/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["hello world"],
"stop_token_ids": [20, 21],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"error": {
"type": "Bad Request",
"code": "unsupported_stop_string",
"message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
"param": null
}
},
"status": "HTTP 400"
}3.2 Chat, stream3.2.1 — stop string, single token ( curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["6"],
"stream": true,
"max_tokens": 100
}'3.2.2 — stop string, multi-token ( curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Repeat exactly: 1 2 3 hello world 4 5"}],
"stop": ["hello world"],
"stream": true,
"max_tokens": 100
}'3.2.3 — stop_token_ids ( curl http://localhost:3000/v1/chat/completions -s -N \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop_token_ids": [20, 21],
"stream": true,
"max_tokens": 100
}'3.2.4 — combined: single-token stop string + stop_token_ids ( curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["6"],
"stop_token_ids": [20],
"stream": true,
"max_tokens": 100
}'3.2.5 — combined: multi-token stop string + stop_token_ids ( curl http://localhost:3000/v1/chat/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"messages": [{"role": "user", "content": "Count from 1 to 10, one number per line"}],
"stop": ["hello world"],
"stop_token_ids": [20, 21],
"stream": true,
"max_tokens": 100
}'3.3 Completion, non-stream3.3.1 — stop string, single token ( curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop": ["6"],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"id": "cmpl_019e12f4-9eda-71c2-bf57-ab5fca3036f6",
"object": "text_completion",
"created": 1778434481,
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"choices": [
{
"text": "5\n",
"index": 0,
"finish_reason": "stop",
"matched_stop": "6"
}
],
"usage": {
"prompt_tokens": 22,
"completion_tokens": 3,
"total_tokens": 25
},
"system_fingerprint": "default"
},
"status": "HTTP 200"
}3.3.2 — stop string, multi-token ( curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Repeat exactly: 1 2 3 hello world 4 5",
"stop": ["hello world"],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"error": {
"type": "Bad Request",
"code": "unsupported_stop_string",
"message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
"param": null
}
},
"status": "HTTP 400"
}3.3.3 — stop_token_ids ( curl http://localhost:3000/v1/completions -s \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop_token_ids": [20, 21],
"stream": false,
"max_tokens": 100
}' | jq{
"id": "cmpl_019e12f4-9faf-7582-bcc0-bb7849cf6585",
"object": "text_completion",
"created": 1778434482,
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"choices": [
{
"text": "",
"index": 0,
"finish_reason": "stop",
"matched_stop": 20
}
],
"usage": {
"prompt_tokens": 22,
"completion_tokens": 1,
"total_tokens": 23
},
"system_fingerprint": "default"
}
3.3.4 — combined: single-token stop string + stop_token_ids ( curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop": ["6"],
"stop_token_ids": [20],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"id": "cmpl_019e13c9-6957-7f10-8038-53f7c215dd87",
"object": "text_completion",
"created": 1778448427,
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"choices": [
{
"text": "",
"index": 0,
"finish_reason": "stop",
"matched_stop": 20
}
],
"usage": {
"prompt_tokens": 22,
"completion_tokens": 1,
"total_tokens": 23
},
"system_fingerprint": "default"
},
"status": "HTTP 200"
}3.3.5 — combined: multi-token stop string + stop_token_ids ( curl http://localhost:3000/v1/completions -s -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop": ["hello world"],
"stop_token_ids": [20, 21],
"stream": false,
"max_tokens": 100
}' | jq -Rs 'split("\n") | {response: .[0] | fromjson, status: .[-1]}'{
"response": {
"error": {
"type": "Bad Request",
"code": "unsupported_stop_string",
"message": "stop string \"hello world\" encodes to 2 tokens; MLX backend only supports single-token stop strings",
"param": null
}
},
"status": "HTTP 400"
}3.4 Completion, stream3.4.1 — stop string, single token ( curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop": ["6"],
"stream": true,
"max_tokens": 100
}'3.4.2 — stop string, multi-token ( curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Repeat exactly: 1 2 3 hello world 4 5",
"stop": ["hello world"],
"stream": true,
"max_tokens": 100
}'3.4.3 — stop_token_ids ( curl http://localhost:3000/v1/completions -s -N \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop_token_ids": [20, 21],
"stream": true,
"max_tokens": 100
}'3.4.4 — combined: single-token stop string + stop_token_ids ( curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop": ["6"],
"stop_token_ids": [20],
"stream": true,
"max_tokens": 100
}'3.4.5 — combined: multi-token stop string + stop_token_ids ( curl http://localhost:3000/v1/completions -s -N -w "\nHTTP %{http_code}" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen3-4B-Instruct-2507-4bit",
"prompt": "Count from 1 to 10, one number per line:\n1\n2\n3\n4\n",
"stop": ["hello world"],
"stop_token_ids": [20, 21],
"stream": true,
"max_tokens": 100
}' |
Hi Keyang, thanks for your comments! The issues you've mentioned are resolved as follows:
Other items:
Solution: used
Yes, the overall test plan executed on my local Mac with real response are done! All test cases are summarized in this test matrix, including I see some other gaps (not directly correlated to this PR) in MLX backend support, like harmony paths, messages/generate endpoints etc. Will discuss these with my investigation later. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ddd72c071
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| .matched_stop_token_id | ||
| .map(|id| serde_json::Value::Number(id.into())), | ||
| // MLX requires request context to resolve the token ID; use matched_stop_json_with_context. | ||
| Self::Mlx(_) => unreachable!("matched_stop_json called for MLX backend"), |
There was a problem hiding this comment.
Keep MLX matched-stop lookup non-panicking for Harmony
Changing matched_stop_json() to unreachable!() for Self::Mlx now makes active Harmony MLX paths crash at runtime, because those flows still call the old method (model_gateway/src/routers/grpc/harmony/processor.rs:62 and .../harmony/streaming.rs:299) instead of matched_stop_json_with_context. Any Harmony request routed to MLX that reaches a Complete frame will panic rather than returning a response, so this should remain non-panicking until all Harmony call sites are migrated.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/grpc_client/src/mlx_engine.rs`:
- Around line 245-246: Update the inline comment that currently says "Messages
and Generate pipelines still reject string stops" to list all three pipelines
that reject string stops (Messages, Generate, and Responses). Locate the comment
near the top of mlx_engine.rs (the block describing stop-sequence support) and
amend the sentence to explicitly include "Responses" alongside "Messages" and
"Generate", and ensure it references the existing reject_stop_strings check used
in the Responses handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 34dfd186-7722-4fc6-a55b-bb7feb896de4
📒 Files selected for processing (11)
crates/grpc_client/src/mlx_engine.rscrates/protocols/src/completion.rscrates/tokenizer/src/mock.rsmodel_gateway/src/routers/grpc/common/stages/helpers.rsmodel_gateway/src/routers/grpc/proto_wrapper.rsmodel_gateway/src/routers/grpc/regular/processor.rsmodel_gateway/src/routers/grpc/regular/stages/chat/request_building.rsmodel_gateway/src/routers/grpc/regular/stages/completion/request_building.rsmodel_gateway/src/routers/grpc/regular/streaming.rsmodel_gateway/src/routers/grpc/utils/chat_utils.rsmodel_gateway/src/routers/grpc/utils/mod.rs
| // - String stop sequences: supported in chat and completion pipelines. | ||
| // Messages and Generate pipelines still reject string stops (see reject_stop_strings). |
There was a problem hiding this comment.
Update the comment to mention all three pipelines that still reject string stops.
The comment states that "Messages and Generate pipelines still reject string stops," but the Responses pipeline (line 422) also retains the reject_stop_strings check. For completeness, the comment should list all three.
📝 Proposed fix to make the documentation complete
- // - String stop sequences: supported in chat and completion pipelines.
- // Messages and Generate pipelines still reject string stops (see reject_stop_strings).
+ // - String stop sequences: supported in chat and completion pipelines.
+ // Messages, Generate, and Responses pipelines still reject string stops (see reject_stop_strings).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // - String stop sequences: supported in chat and completion pipelines. | |
| // Messages and Generate pipelines still reject string stops (see reject_stop_strings). | |
| // - String stop sequences: supported in chat and completion pipelines. | |
| // Messages, Generate, and Responses pipelines still reject string stops (see reject_stop_strings). |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/grpc_client/src/mlx_engine.rs` around lines 245 - 246, Update the
inline comment that currently says "Messages and Generate pipelines still reject
string stops" to list all three pipelines that reject string stops (Messages,
Generate, and Responses). Locate the comment near the top of mlx_engine.rs (the
block describing stop-sequence support) and amend the sentence to explicitly
include "Responses" alongside "Messages" and "Generate", and ensure it
references the existing reject_stop_strings check used in the Responses
handling.
Description
Follow-up on (#1099) to support chat/completion with
stopfield.Problem
support string stop sequences for chat and completion
Solution
convert stop strings to stop token ids before passing to mlx backend
Changes
Test Plan
See detailed curl command and real responses here
Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
New Features
Improvements